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ABSTRACT 

Nonnegative Matrix Factorization (NMF) has been contin- 
uously evolving in several areas like pattern recognition 
and information retrieval methods. It factorizes a matrix 
into a product of 2 low-rank non-negative matrices that 
will define parts-based, and linear representation of non- 
negative data. Recently, Graph regularized NMF (GrNMF) 
is proposed to find a compact representation,which uncov- 
ers the hidden semantics and simultaneously respects the 
intrinsic geometric structure. In GNMF, an affinity graph 
is constructed from the original data space to encode the 
geometrical information. In this paper, we propose a novel 
idea which engages a Multiple Kernel Learning approach 
into refining the graph structure that reflects the factoriza- 
tion of the matrix and the new data space. The GrNMF 
is improved by utilizing the graph refined by the kernel 
learning, and then a novel kernel learning method is intro- 
duced under the GrNMF framework. Our approach shows 
encouraging results of the proposed algorithm in compari- 
son to the state-of-the-art clustering algorithms like NMF, 
GrNMF, SVD etc. 

KEY WORDS 

Data Representation, Nonnegtive Matrix Factorization, 
Graph Regularization, Multiple Kernel Learning. 

1 Introduction 

Nonnegative matrix factorization (NMF) ifTBI has been in- 
troduced as a matrix factorization technique that produces 
a useful decomposition in the analysis of data. NMF de- 
composes the data as a product of two matrices that are 
constrained by having nonnegative elements. This method 
results in a reduced representation of the original data that 
can be seen either as a feature extraction or a dimensional- 
ity reduction technique, and have become popular in recent 
years for data representation for bioinformatics, medical 
imaging, pattern recognition and information retrieval. Re- 
cently, Cai et al. improved the transitional NMF to Graph 
regularized Nonnegative Matrix Factorization (GrNMF) in 
0. The basic idea is that the data is drawn from sampling 
a probability distribution that has support on or near to a 
sub-manifold of the ambient space. One then hopes to find 
a compact representation, which uncovers the hidden se- 



mantics and simultaneously respects the intrinsic geomet- 
ric structure. In GrNMF, the geometrical information of 
the data space is encode by constructing a nearest neighbor 
graph, and then the matrix factorization is sought respect- 
ing the graph structure. 

The key component of GrNMF is the graph. In the 
original GrNMF algorithm, the graph is constructed ac- 
cording to the original input feature space. The nearest 
neighbors of a data point are found by comparing the Eu- 
clidean distances ll23l between pairs of data points, while 
the weights of edges are also estimated in the Euclidean 
space. However, as is well known that in some data clus- 
tering and classification problems, using the original linear 
feature space directly is not appropriate because for many 
applications, the distribution of original data space is non- 
linear, which brings problem for the graph construction for 
GrNMF. We can solve this problem by mapping the input 
data into a nonlinear feature space, where the mapping is 
represented by introducing a kernel. The graph should be 
constructed according to the new nonlinear data space rep- 
resented by a kernel. In practice, the types and the param- 
eters of the kernels must be selected. Unfortunately, the 
most suitable kernel for a particular task is often unknown 
in advance. Moreover, exhaustive search on a user-defined 
pool of kernels will be quite time-consuming when the size 
of the pool becomes large. Recently, the so-called multiple 
kernel learning method [20 ,8 7] have shown the necessity 
to consider multiple kernels or the combination of kernels 
rather than a single fixed kernel for data representation. In 
fact, multiple kernel learning, graph construction and NMF 
have been extensively studied for data representation in the 
literatures respectively, but have never been investigated in 
a uniform framework, thus the inherent relationship among 
them has been neglected. 

In this paper, we try to investigate the inherent rela- 
tionship between multiple kernel learning and NMF with 
graph regularization. The multiple kernel learning will 
provide a new data space for the graph construction of 
GrNMF, and GrNMF will also provide the criterion for 
feature selection/ multiple kernel learning. We will unify 
the multiple kernel learning and GrNMF within a single 
object function and repeat their optimizations alternately, 
so that they will effect the learning of each other. In this 
paper, we propose a unified multiple kernel learning and 



graph regularization framework for NMF, referred to as 
Adaptive Graph regularized NMF with Multiple Kernel 
(AdpaGrNMFjVfu^jjf), for data representation of cluster- 
ing and classification tasks. The main contributions of this 
paper include: 

1 . We propose the unified frameworks for multiple ker- 
nel learning with new matrix factorization objective 
functions and incorporate the graph structure into it. 

2. GrNMF is improved by utilizing the graph adaptive to 
the new data space refined by multiple kernel learning. 

3. A novel kernel learning method is proposed under the 
framework of GrNMF. 

The rest of the paper is organized as follows: We 
briefly review the GrNMF in Section [2] In Section [3] we 
present AdapGiNMF MuitiK algorithm to tackle the mul- 
tiple kernel learning problem for GrNMF. We experimen- 
tally compare the proposed methods with other NMF learn- 
ing methods on the two data sets for clustering and classi- 
fication tasks in Section|4] Finally, conclusive remarks and 
future works are presented in Section [5] 

2 Overview of Graph Regularized NMF 



2.1 Nonnegative Matrix Factorization 



Given N data points X = {x\, ■ • 
sented as a data matrix X = [x\,- 
consider factorizations of the form: 



,x N } £ 



pDxN 



repre- 
We 



X w HW 



(1) 



where X £ R DxJV , H £ R DxR , and W £ R RxN . Com- 
monly, we have R <C D and R <C N. NMF can be written 
in this form, where the data matrix X is assumed to be non- 
negative, as are the factors H and W |[6). 

NMF aims to find two nonnegative matrices H and W 
whose product can well approximate the original matrix X 
as in ([T). In reality, each data vector x n is approximated by 
a linear combination of the columns of H, weighted by the 
components of W, as 



where Tr(-) denotes the trace of a matrix. The above ob- 
jective function can be minimized by the iterative update 
algorithm proposed by Lee and Seung Ifl6l . 

2.2 Graph regularized NMF 

By performing this learning in the Euclidean space, NMF 
fails to discover the intrinsic geometrical and discriminat- 
ing structure of the data space Q]. To avoid this lim- 
itation Cai et al. [4] introduced the Graph regularized 
NMF (GrNMF) algorithm, by incorporating a geometri- 
cally based regularizer. 

In ID the Local Invariance Assumption (LIA) that 
was imposed to NMF as: if two data points x n and x m 
are close in the intrinsic geometry of the data distribution, 
then h n and h m , the coding vectors of these two points with 
respect to the new basis, are also close to each other; vice 
versa. Cai et al. modeled the local geometric structure by a 
P nearest neighbor graph Q on a scatter of data points. 

For each data point x n £ X, its P nearest neighbors 
N n in X can be determined via squared Euclidean distance 
metric l23l as 



(4) 



diXmXjn) \\x n X rl 



A P nearest neighbor graph is constructed for X as Q = 
{V, £, A}. The node set V corresponds to N data points. £ 
is the edge set, and (x n , x m ) £ £ if x m £ J\f n . A £ R NxN 
is the weight matrix on the graph with A nm equal to the 
weight of edge (x n , x m ). There are many choices to define 
the weight matrix A. Two of the most commonly used are 
as follows: 



0-1 Weighting 



Dot-Product Weighting 



0, else. 



^>m j if (Xn ; X m j £ £ , 



0. 



(5) 



(6) 



(2) 



Therefore, H can be regarded as containing a set of ba- 
sis vectors. Let w n = [w n i, ■ ■ ■ , w n fj] T denote the n-th 
columns of W. w n can be regarded as the coding vector or 
a new representation of the n-th data point with respect to 
the basis H. 

The most commonly used cost function is the squared 
Euclidean distance between two matrices (the square of the 
Frobenius norm of two matrices difference): 

O nmf {H, W) =\\X-HW\\ 2 

=Tr(XX T ) - 2Tr(XW T H T ) (3) 
+ Tr(HWW T H T ) 



With the defined weight matrix A above, we can use 
the following Graph regularization term to measure the 
smoothness of the low-dimensional coding vector represen- 
tations in W 



O 



Gr (W;A)J- J2 



1 A 



n.m— 1 



-Tr(WDW T ) - Tr(WAW T ) = Tr(WLW T ) 



(7) 

where D is a diagonal matrix whose entries are column 
sums of A, D nn = Xm=i A n m and L = D — A is the 
graph Laplacian. 

By minimizing Gr (W; A) with respect to W, we 
expect that if two data points x n and x m are close (i.e., A nm 



is big), w n and w m are also close to each other. This ob- 
jective function is similar to the one used in LPP Q, in 
which it is assumed that the low-dimensional coordinates 
share the same linear construction weights with the high- 
dimensional coordinates. Differently, we assume that the 
sharing relation exists between the coding coefficient space 
of NMF and the feature space. 

Combining this geometrically-based regularizer 
Gr (W;A) with the original NMF objective function 
NMF (H, W) leads to the loss function of GrNMF H): 

GrNMF (H, W- A) =O nmf (H, W) + aO Gr {W- A) 

=Tr(XX T ) - 2Tr(XWTH T ) 

+ Tr(HWW T H T ) + aTr(WLW T ) 

(8) 

in which a is the tradeoff parameter to balance the two 
terms. Thus the GrNMF problem turns to a minimization 
problem as 



nun 
H,w 



in GrNMF (H,W;A) 



subject to H > 0, W > 0. 



(9) 



where H and W can be solved in a iterative way by updat- 
ing them alternately flU . 

3 Adaptive Graph Regularized NMF with 
Multi- Kernel Learning 

In this section, we attempt to obtain an appropriate data 
representation for GrNMF in the Hilbert space of kernel 
methods. Accordingly, multiple kernel learning is consid- 
ered. Moreover, we re-construct the new graph adaptive to 
the new data representation and re-regularize the NMF, and 
then re-estimate the kernel coefficients, resulting the novel 
iterative data representation algorithms by regularizing 
NMF by Adaptive Graph — AdapGrNMF M «;«K- 

3.1 Multiple Kernel Learning for NMF 

Consider a nonlinear mapping x n —> <p(x n ) or X — >• 
ip(X) = [ip(xi),--- ,lp(xn)]- Then the kernel matrix 



K G 



pNxN : 



is given by K — ip(X) if(X). A direct 



application of NMF to the feature matrix tp(X) yields 

tp(X) w HW (10) 

While, in NMF there are no constraints on the basis 
vectors H = [hi, ■ ■ ■ , Hr]. For reasons of interpretability 
it may be useful to impose the constraint that the vectors 
defining H lie within the column space of ip(X): h r = 



En=l fnr<P(Xn) or 



H = ip(X)F 



(11) 



where f nr is the (n, r)-th element of the matrix F 6 
R NxK , F > 0. Substituting CCD to ®, we have the objec- 



tive function for the kernelized version of NMF 

O NMFk (F,W) =\\tp(X) -ip(X)FW\\ 2 

=Tr[(tp(X) - <p(X)FW)(tp(X) - ip(X)FW) T ] 
=Tr[<p(X)(I - FW)(I - FW) T ip(X) T ] 
=Tr[(p(X) T cp(X)(I - FW)(I - FW) T ] 
=Tr[K(I - FW)(I - FW) T ] 

(12) 

Suppose there are altogether L different kernel func- 
tions {Ki}f =1 available for the NMF task in hand. Ac- 
cordingly, there are L different associated nonlinear feature 
spaces. In general, we do not know which kernel space 
should be used. An intuitive way is to use them all by 
concatenating all feature spaces into an augmented Hilbert 
space, and associate each feature space a relevance weight 
ti, Ti > 0, Ylt=i T i = 1- We denote the kernel weights as 
a vector r = [ri, • • • , tl] t . Performing the NMF in such 
feature space is equivalent to employing a combined kernel 
function for the NMF: 



K r = nK t 



(13) 



i=i 



Substitute this relation into (fT2l to obtain the objective 
function for Multiple Kernel based NMF (NMF M uIuk)- 



qNMFmuIuk Wf T ) =Tr 



J2nKi(I - FW)(I - FW) T 
(14) 



i=i 



3.2 Graph Adaptive to Multiple Kernel Learning 

To update the graph Q regarding the multiple kernel space, 
given a r, the P nearest neighbors A/j[ for the GrNMF al- 
gorithm will be re-found by the r-weighted squared Eu- 
clidean distance in multiple kernel space, i.e., 

d r {x n ,X m ) = \\<p{x n ) -ip(x m )\\l 

= ip(x n ) T (p(x n ) + ip(x m ) T ip(x m ) - 2ip(x n ) T ip(x m 
— K x n ^ -\- K (x n , x m ) IK {x n: x ra ) 

L 

= }^Ti[Ki(x n ,x n ) + Ki(x n ,x m ) - 2Ki(x n ,x m )} 
i=i 

(15) 

The corresponding P nearest neighbor graph adap- 
tive to t is donated as Q T = {V, £ T , A T }. Here we discuss 
the updating of 0-1 weighting and dot-product weighting 
for the weight matrix A T of adaptive graph with multi- 
ple kernel learning. 0-1 weighting is simply updated as 
A T nm = 1, if (n, m) £ £ T ; 0, otherwise. For dot-product 
weighting, A T nm = ip(x n ) T ip{x m ) = K T {x n ,x m ) = 
J2i=i T lKi(x n ,x m ),if (n,m) G £ T . 

With the graph Q T adaptive to the multiple kernel 
space, we then re-regularize the NMFmuIuk in the multi- 
ple kernel space. Similar to the GrNMF and AdapNMF/,, 



we propose the Adaptive Graph regularization term as 



AdapGr {W;A T ) =- 



N 



2 ^ 

n.m— 1 



w m \\ 2 A T 



(16) 



--Tr(WL T W T ) 



where L T = D T — A T is the corresponding graph Lapla- 
cian. 

By minimizing AdapGr (W; A T ), we expect that if 
two data points <p(x) n and (p(x) m are close respecting to 
the new kernel regarding r, then the representations w n and 
w m of these two points with respect to the new feature se- 
lected basis H — (p(X)F are also close to each other. 

3.3 AdapGrNMF Algorithm with Multiple Kernel 
Learning 

To perform the multiple kernel representation together with 
the adaptive graph regularized NMF, we first propose the 
unified AdapGrNMF and multiple kernel learning object 
function for data representation by combining the loss 
function of NMF with multiple kernel learning and adap- 
tive graph regularization term, and then develop an alter- 
nating update algorithm to estimate the basis matrix H = 
(p(X)F, coefficient matrix W and the kernel weight vector 
r as follows. 

3.3.1 Object function 

Combining the adaptive graph-based regularizer defined 
in ([Tol l with the NMF objective function with multiple 
kernel defined in (fl4] > leads to the optimization problem 
of our AdapGrNMF with Multiple Kernel learning — 

AdapGrNMF M u/tiif: 

min o AdapGrNMFM - lUK (F, W, r) 

F,W,t 

= O NMFmmk (F, W, t) + aO AdapGr (W; A T ) 



Tr 



J2nKi(I - FW)(I - FW) T 



i=i 



+ aTr(WL T W T ) 
= Tr [K T (I - FW){I - FW) T ] 
+ aTr(WL T W T ) 

L 

s.t. F > 0, W > 0, r > 0, = 1. 

1=1 



3.3.2 Optimization 



(17) 



Since direct optimization to (TTTb is difficult, we instead 
adopt an iterative, two-step strategy to alternately optimize 
(H, W) and r. At each iteration, one of (H, W) and r 
is optimized while the other is fixed, and then the roles of 



(H, W) and r are switched. Iterations are repeated until 
convergence or a maximum number of iterations is reached. 

• On optimizing (F, W): By fixing r and updating the 
adaptive graph Q T and kernel matrix K T , the opti- 
mization problem ( TPTI i is reduced to 



min Tr [K T (I - FW)(I - FW) T ] 

+ aTr(WL T W T ) 
s.t. F > 0, W > 0. 



(18) 



The Lagrange L of the above optimization problem is 

C =Tr(K T I) - 2Tr(K T W T F T ) 

+ Tr(K T FWW T F T ) + aTr(WL T W T ) 
+ Tr(^F T )+Tr{^W T ) 

(19) 

where $ = [4> nr ] and = [t/j rn ] are the la- 
grange multiplier matrices for constraint F > 
and W > respectively. As we mentioned be- 
fore, it is often difficult to find a closed form for 

QAdapGrNMF MultlK ^ ^ T y Such difficu l ties of _ 

ten arise when one wishes to maximize or minimize 
a function subject to fixed outside conditions or con- 
straints. Introducing $ and ^ as our Lagrange multi- 
pliers is important for solving this class of problems 
without the need to explicitly solve the conditions and 
use them to eliminate extra variables. 

By setting the partial derivatives of C with respect to 
F and W to zero, we have 

gr 

— = -2K T W T + 2K T FWW T + $ = 
oF 



dC 
dW 



= -2F T K T + 2F T K T FW + 2aWL T + * = 



(20) 



Using the KKT conditions 4>drfdr — and ij) rri w- 
0, we get the following equations for hd r and w 



Til 

rn • 



(K T FWW T ) nr f nr - (K T W T ) nr f nr = 

(F T K T FW + aWD T ) rn w rn - (F T K T + aWA T ) rn w r 

= 

(21) 

The chosen KKT (Karush-Kuhn-Tucker)conditions 
are satisfied at the minimum (F, W) of the given con- 
strained optimization problem, given any constraints, 
provided that the intersection of the set of feasible di- 
rections with the set of descent directions coincides 
with the intersection of the set of feasible directions 
for linearized constraints with the set of descent di- 
rections. This rather technical regularity assumption 
holds for all classification problems, since the con- 
straints are always linear. For convex problems (if the 
regularity condition holds), the KKT conditions are 
necessary and sufficient to find a solution J3J. 



These equations lead to the following updating rules: 



4.1 Experiment I: Document Clustering 



fnr <~ (KT 



{K T W T ) r 



" In 



Wrn <~ 



(K T FWW T ) nr 

(F T K T + aWA T ) r 
(F T K T FW + aWD T 



(22) 



On optimizing r: By fixing (F, W) and removing 
the irrelevant terms, the optimization problem (fTTT i be- 



min Tr 



J2ni<i(i - FW)(I - FWY 



U=i 



Tr 



^nKtZZ 1 
.i=i 

L 

s.t. t > o, ^2n = i. 



nm 



(23) 



i=i 



where Z = 1- FW and 5; = Tr [K t ZZ T ] . The op- 
timization of (1231 with respect to the feature weights 
t is a standard Linear Programming (LP) problem . 

3.3.3 Algorithms 

The iterative AdapGrNMF algorithm with multiple kernel 
learning (named as AdapGrNMF muIuk) is summarized in 
Algorithm!]] 

Algorithm 1 AdapGrNMF MuiuK Algorithm. 
Require: L base kernel matrices Ki , I = 1, • • • , L; 
Require: Initial factorization matrices F° and W°; 
Require: Tolerance stopping criterion £; 

Initialize the kernel weight variables as t" = = 

!••••/•: 
Initialize t = 1; 
repeat 

Update the graph Q rt and its corresponding Laplacian 
matrix L rt according to r t_1 as introduce in section 

E2 

Update the factorization matrices F* and W l as in 

Update the kernel weights r* as in d23l ; 

* = t + 1; 

until o AdapGrN M FmuIUK (F*, W l ,t 1 ) < £ 
Output F = F*-\ = ly*- 1 and r = r'" 1 . 



4 Experiments 

In this section, we investigate the use of our proposed 
AdapGrNMFMuZtiif algorithms for document clustering 
and face recognition. 



In this section, we will evaluate our proposed 
AdapGrNMF MuiuK for data representation in docu- 
ment clustering task. 

4.1.1 TDT2 Document Dataset and Setup 

The first data set is the NIST Topic Detection and Track- 
ing (TDT2) corpus [1 1. The TDT2 corpus consists of data 
collected during the first half of 1998 and taken from six 
sources. It consists of 11,201 on-topic documents, which 
are classified into 96 semantic categories. In this exper- 
iment, following |4|, those documents appearing in two 
or more categories were removed and only the largest 30 
categories were kept, thus leaving us with N = 9, 394 
documents in total. Each document is represented as a 
D = 36, 771 dimension nonnegative feature vector. 

We set the dimensionality of the new space R to be 
the same as the number of clusters. Assume that the docu- 
ment corpus is comprised of R clusters each of which cor- 
responds to a coherent topic. We project a documents fea- 
ture vector into a F-dimensional semantic space in which 
each axis corresponds to a particular topic. In this se- 
mantic space, each document can be represented as a lin- 
ear combination of the R topics using NMF, GrNMF or 
our AdapGrNMFMuUiK- We apply different matrix factor- 
ization algorithms to obtain new data representations W. 
Therefore, this step maps the data from the original space 
to a low dimensional (i?-dimensional) space. Kmeans [ 10 | 
is then applied to the new data representation W for docu- 
ment clustering. We finally compare the obtained clusters 
with the original image category to compute the accuracy. 

For this dataset, we applied our 
AdapGrN M FmuIuk algorithm with 10 pre-computed 
base kernels altogether, i.e., 



seven RBF kernels K (x n , x m ) — exp(- 



2^" 



with er = const x g, where g is the maximum distance 
between samples and const varies in the pre-specified 
range of {0.01, 0.05, 0.1, 1, 10, 50, 100}. 

• two polynomial kernels K(x n , x m ) = (1 + x^x m ) p 
with degree p = {2,4}, and 

T 

• a cosine kernel K(x n , x m ) 



IKIHNmll 

4.1.2 Experiment Resutls 

We compare our AdapGrN MFmuIuk algorithm to other 
state-of-the-art related clustering algorithms applied to this 
dataset. The algorithms that we evaluated are listed below: 

• Kmeans [ 10 1 clustering with original data space, 

• Singular Value Decomposition (SVD), 

• Normalized Cut (NCut), a typical spectral clustering 
algorithms; 



• Original Nonnegative Matrix Factorization based 
clustering (NMF) (6); 

• Graph regularized Nonnegative Matrix Factorization 
(GrNMF) |H; 

• Our AdapGrNMFM uitiK algorithm. 

In order to randomize the experiments, the evalua- 
tions are conducted with different numbers of clusters vary- 
ing from 5 to 30. For the fixed cluster number R, we ran- 
domly choose R categories from the data set, and mix the 
images of these R categories as the collection X for clus- 
tering. This procedure is repeated 20 times with different 
initial points and the best result in terms clustering accu- 
racy of Kmeans is recorded. Vectors of documents were 
created using the term frequency. Then they were clustered 
using the new and the old algorithms. The documents were 
labeled using the majority approach, i.e., if most of them 
assigned to a label belong to the cluster C, then the label of 
the document is designated as C. After this process is done, 
the documents get labeled and we know their classes. Then 
comparing the actual classes with the found classes we can 
obtain the number of correctly clustered documents. The 
accuracy of clustering is given by: 



Accuracy = 



Number of Correctly Clustered Documents 



Total Number of Documents 



(24) 
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data distributions of TDT2 data set collected in dif- 
ferent topics are quite different. It is interesting to 
observe that NMF outperforms SVD and Kmeans in 
terms of clustering accuracies, but NCut is better than 
NMF. 

2. GrNMF is worse than AdapGrNMFjvf^uix* The as- 
sumption in GrNMF is that the graph constructed from 
the original data space can reflect its manifold struc- 
ture. When the data distributions of different features 
change considerably in cross-topics learning, the op- 
timal combination coefficients W may not be effec- 
tively learned by using GrNMF methods based on the 
graph from original domains. 

3. GrNMF and AdapGrNMF M „ it i X outperform NMF in 
terms of clustering accuracies from all the 6 groups of 
experiments, which demonstrates that the information 
from the graph can be effectively used in NMF to im- 
prove the clustering performance in the TDT2 dataset. 

4. AdapGvNMF MuitiK is better than GrNMF in terms of 
clustering accuracies over all the 6 groups of experi- 
ments. Moreover, AdapGrNMFMuiti.fr and GrNMF 
outperform all other method. These results clearly 
demonstrate that the AdapGrNMFMufti.fr method can 
successfully minimize the data distribution mismatch 
between <fi(X) and HW the structural risk func- 
tional through effective combination of multiple base 
kernels. AdapGrNMFMufti/r is better than GrNMF 
because of the additional utilization of the Adap- 
tive Graph. In addition, some concepts enjoy large 
performance gains. For instance, the accuracy for 
the task with 30 clusters significantly increases from 
71.90% (NMF) to 93.28% (AdapGrNMF MtlWK ), 
equivalent to a 21.38% relative improvement. Com- 
pared with the best results from the existing meth- 
ods, AdapGrNMFjv/uitiif (93.28%) enjoys a relative 
improvement of 4.68% over GrNMF (88.60%). 

4.2 Experiment II: Face Recognition 

We also evaluated the performance of AdapGrNMFMufti.fr 
as a feature representation method in the task of supervised 
face recognition. 



Figure 1. Comparison of clustering accuracy when the 
cluster number varies from 5 to 30 on TDT2 dataset. 

The average results of all methods are presented in 
Fig. Q] From Fig. [T] we have the following observations: 

1. NMF is much worse than GrNMF according to the 
clustering accuracies over all the 6 clustering number 
settings, which demonstrates that the W representa- 
tion learned with original data space performs poorly 
on the clustering domain. The explanation is that the 



4.2.1 Yale Face Dataset and Setup 

The Yale database contains 165 gray scale images of 15 
individuals EH . There are 11 images per subject, one 
per different facial expression or configuration: centerlight, 
w/glasses, happy, left-light, w/no glasses, normal, right- 
light, sad, sleepy, surprised, and wink. Thus, each image 
is also represented by a 1024-dimensional vector in image 
space. 

We divided facial images into training set X tra in 
and test set X test . The training set matrix are firstly de- 
composed using the AdapGrNMFMuftiir as 4>{Xtrain) ~ 



4>( x tra m )Ftra m W tr ain- Then the feature matrix W tes t 
in the case of test set X tes t is computed by LS pro- 
jection regarding to F train . The kernels K train = 

4>{XtrainY 4>{X tr ain) and K test = <j}(X t e S t) T 4>(X train ) 

are combination of the following 1 1 RBF kernel matrices 
with different bandwidths. We used 11 different values 
of const = 1,2,4,8, 16,32}. The fea- 

ture matrix will be used as features of Nearest Neighbor 
(NN) classifier and classification accuracies will be aver- 
aged over 20 independent runs for comparison. 



4.2.2 Experiment Results 

In this experiment, we compare our algorithm 
AdapGrNMF MuiuK with the following related algo- 
rithms: 
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Figure 2. Comparison of recognition accuracy when the 
intrinsic dimension R varies from 20 to 200, in the case of 
Train-4 on Yale dataset. 



• Single kernel-based NMF (NMFa) G3; 

• Multiple Kernel-based NMF (NMFmuMk) E); 

Fig. [2] shows the classification accuracy when the 
value of R varies from 20 to 200. Fig. [2] shows 
that the proposed AdapGrNMFMuifift" consistently outper- 
forms NMFm u iuk for all numbers of features R. To fur- 
ther show the consistency of the AdapGrNMF MuiuK in the 
accuracy improvement, Fig. Oplots the curves for of clas- 
sification accuracy with different training sample number. 
For each individual in the dataset, we assigned randomly- 
selected 2 (3, or 4) images as training samples into the 
training set and remaining images into the test set. It 
shows that AdapGrNMFMu/^A consistently outperforms 
NMpMuitiK for all different training sample numbers. 

Although all the tested algorithms have the same 
number of basis vector R, Fig. [2] shows that 
AdapGrNMF MuiuK outperforms all other algorithms for 
all R as the "larger" R has higher recognition accuracy, 
which is a strong proof of the power of multiple kernel 
and graph regularizes However, if the dimension is overre- 
duced, NMF MuiuK may not outperform NMF a-, as shown 
in Fig. [2] Fig. [2] shows that the over-dimension reduction 
by NMF or NMF a- results in sharp increase of the classifi- 
cation error while the discriminant methods NMFmuIuk 
and AdapGrNMF A/uiti a are much less sensitive to the 
number of features. Note that although NMFa' maps the 
data into a nonlinear data space, the discriminant analyses 
NMFa where the kernel selection is performed via cross- 
validation performs very badly if applying them directly 
on the original 1024-dimensional space, as shown in Fig. 
|2] and Fig. [3] However, after applying NMFmuIuk or 
AdapGrNMF^/„; t iA to remove the unreliable kernels, the 
discriminant analysis NMFmuiuk or AdapGrNMFMuitzir 
outperforms NMF or NMFa- 
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Figure 3. Comparison of recognition accuracy when the 
training sample number varies from 2 to 4 on Yale dataset. 



5 Conclusion 

There has been an increasing interest in the nonnegative 
matrix factorization method in the past few years due to its 
helpful capability in retrieving human intelligible features. 
Data analysis processing is a complex task, especially when 
high dimensional and noisy data is used, so that any method 
that helps in alleviating the interpretation of the data be- 
comes very appealing. The method presented here is an 
attempt to improve the ability of the classical NMF algo- 
rithm. The proposed Adaptive graph regularized NMF with 
multiple Kernel learning has proven to be the most accurate 
among the algorithms that has been used to solve the mul- 
tidimensional classification problem. It has surpassed the 
clustering accuracy of the best scoring GrNMF approach 



by at least 10% even when the cluster number is increased, 
because of the additional utilization of adaptive graph. The 
main advantage of this algorithm is that it alternately opti- 
mizes (H, W) and r . At each iteration, one of (H, W) and 
t is optimized while the other is fixed, and then the roles of 
(H, W) and r are switched. Iterations stop when reaching 
the needed accuracy which will allow us to refine the clus- 
tering techniques as our processing resources are increased 
or as our experience in the field becomes more mature. 

In the future, more detailed investigation of the the- 
oretical and biological basis is desired. The proposed 
AdapGrNMFMu/tiif algorithm can also be applied to other 
aspects, such as bioinformatics 1301 [Bl fl4l l28l l35l l22l. 
medical imaging E3 El El E3, biometrics |331|3T1[TT1 
E3 US [361 13 EU El El and computer vision ||32l |38l |29l 

ED. 
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